Bank Customer Churn Analysis Using Support Vector Machine

Author

Gopi Shankar Reddy Mallu, Kavya Reddy Maale, Satya Nageswara Dinesh Donkada, Vamsi Krishna Kalla

Published

April 15, 2024

Slides

Introduction

Support Vector Machines (SVMs) are a type of supervised learning algorithm that can be used for classification or regression tasks. The main idea behind SVMs is to find a hyperplane that maximally separates the different classes in the training data. This is done by finding the hyperplane that has the largest margin, which is defined as the distance between the hyperplane and the closest data points from each class. Once the hyperplane is determined, new data can be classified by determining on which side of the hyperplane it falls. SVMs are particularly useful when the data has many features, and/or when there is a clear margin of separation in the data.

Fig: - Linear & Non-Linear Separable Data

Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.

Fig: - Hyperplane in 2D & 3D feature space

Methods

Mathematical Intuition of Support Vector Machine

Consider a binary classification task where there are two classes, denoted by the labels +1 and -1. The input feature vectors (X) and the matching class labels (Y) comprise our training dataset.

Equation for hyperplane can be written as:

\[ w^Tx+b=0 \tag{1}\]

The vector W represents the normal vector to the hyperplane. i.e the direction perpendicular to the hyperplane. The parameter b in the equation represents the offset or distance of the hyperplane from the origin along the normal vector w.

\[ d_i = \frac{w^Tx_i + b}{\|w\|} \] {#eq- distance of hyperplane}

where ||w|| represents the Euclidean norm of the weight vector w. Euclidean norm of the normal vector W

\[ \hat{y} =\begin{cases} 0 & \text{if } w^T x + b \geq 0 \\ 1 & \text{if } w^T x + b < 0 \end{cases} \] {#eq-Euclidean norm}

kernel function in SVM

In Support Vector Machines (SVM), the kernel function plays a crucial role in transforming the input feature space into a higher-dimensional space where the data can be linearly separated. This is particularly useful in cases where the data is not linearly separable in its original space. The kernel function computes the dot product between the feature vectors in this higher-dimensional space without explicitly mapping the vectors into that space, which is known as the “kernel trick.”

Common types of kernel functions include:

  • Linear Kernel: \(K(w,b)=w^Tx+b\). This is the simplest form of the kernel, used when the data is linearly separable.

  • Polynomial Kernel: \(K(w, b) = (1 + w^T.x b)^d\). This kernel maps the input features into a polynomial feature space, allowing for polynomial decision boundaries.

  • Radial Basis Function (RBF) Kernel: \(K(w, b) = \exp(-\gamma |w.x - b|^2)\). Also known as the Gaussian kernel, it maps the features into an infinite-dimensional space, providing a lot of flexibility for non-linear decision boundaries.

Each kernel function has its own set of parameters that need to be tuned for optimal performance. The choice of kernel function and its parameters can significantly impact the SVM model’s ability to capture the underlying patterns in the data.

Margin and Support Vectors

The margin in SVM is defined as the distance between the separating hyperplane and the nearest data points from each class, known as the support vectors. The goal of SVM is to find the hyperplane that maximizes this margin, as a larger margin is associated with better generalization ability of the model.

Support vectors are the data points that lie closest to the decision boundary and are critical in defining the position and orientation of the hyperplane. These are the points that directly influence the shape of the decision boundary, as any small change in their position can alter the hyperplane. The SVM model is said to be “sparse” because only the support vectors contribute to defining the hyperplane, while other data points have no influence.

Objective Function and Optimization

The objective function that SVM optimizes is a combination of maximizing the margin and minimizing the classification error. This is achieved through the minimization of the following objective function:

\[ min_{w, b} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^{n} \xi_i \]

Subject to the constraints:

\[ y_i (w^T x_i + b) \geq 1 - \xi_i \quad \text{and} \quad \xi_i \geq 0 \quad \text{for all } i \]

where \(w\) is the weight vector, \(b\) is the bias term, \(C\) is the regularization parameter, \(\xi_i\) are the slack variables representing the degree of misclassification of the \(i\)-th data point, and \(y_i\) are the class labels.

The hinge loss function is used in SVM to penalize misclassifications. It is defined as:

Hinge loss = \(\max(0, 1 - y_i (w^T x_i + b))\)

The hinge loss is zero for correctly classified points that are outside the margin, and it increases linearly for points that are on the wrong side of the hyperplane or within the margin.

The optimization of the objective function involves finding the values of \(w\) and \(b\) that minimize the function, subject to the constraints. This is typically done using quadratic programming techniques.

Analysis and Results

Dataset

Customer retention is a critical aspect for banks to ensure the sustainability of their operations. ABC Multinational Bank, in particular, places a strong emphasis on retaining its account holders. The primary objective of this analysis is to examine the customer data of the bank’s account holders to predict and prevent customer churn effectively.

The dataset under consideration contains information about account holders at ABC Multinational Bank, with the ultimate goal of predicting customer churn. The dataset comprises the following columns:

Column Name Description
customer_id A unique identifier for each customer, not used in the analysis.
credit_score A numerical representation of the customer’s creditworthiness.
country The country in which the customer resides.
gender The gender of the customer (e.g., male, female).
age The age of the customer in years.
tenure The number of years the customer has been with the bank.
balance The current balance in the customer’s account.
products_number The number of products the customer has with the bank.
credit_card Indicates whether the customer has a credit card with the bank.
active_member Indicates whether the customer is an active member.
estimated_salary The estimated annual salary of the customer.
churn The target variable, indicating customer churn (1 for churned, 0 for not churned).

Source: - Bank Churn Dataset

Loading Libraries

Code
library(tidyverse)
library(dplyr)
library(ggplot2)
#install.packages("corrplot")
library(corrplot)
library(caret)
library(smotefamily)
library(ROSE)
library(caret)
library(DMwR)
library(e1071)
library(pROC)
library(doParallel)
library(foreach)
library(randomForest)  # For Random Forest
library(xgboost)

Load Data

Code
df <- read.csv("dataset/train.csv")
Code
dim(df)
[1] 165034     14

Summary Statistics

Code
summary(select(df, CreditScore, Age, Tenure, Balance, NumOfProducts, EstimatedSalary))
  CreditScore         Age            Tenure         Balance      
 Min.   :350.0   Min.   :18.00   Min.   : 0.00   Min.   :     0  
 1st Qu.:597.0   1st Qu.:32.00   1st Qu.: 3.00   1st Qu.:     0  
 Median :659.0   Median :37.00   Median : 5.00   Median :     0  
 Mean   :656.5   Mean   :38.13   Mean   : 5.02   Mean   : 55478  
 3rd Qu.:710.0   3rd Qu.:42.00   3rd Qu.: 7.00   3rd Qu.:119940  
 Max.   :850.0   Max.   :92.00   Max.   :10.00   Max.   :250898  
 NumOfProducts   EstimatedSalary    
 Min.   :1.000   Min.   :    11.58  
 1st Qu.:1.000   1st Qu.: 74637.57  
 Median :2.000   Median :117948.00  
 Mean   :1.554   Mean   :112574.82  
 3rd Qu.:2.000   3rd Qu.:155152.47  
 Max.   :4.000   Max.   :199992.48  

Credit Score:

  • The Credit Score ranges from a minimum of 350 to a maximum of 850.

  • The median Credit Score is 659, indicating that half of the customers have a score below 659 and half have a score above.

  • The mean Credit Score is approximately 656.5, suggesting that the average creditworthiness of customers is in the mid-range.

  • The 1st quartile (25th percentile) is 597, and the 3rd quartile (75th percentile) is 710, indicating that 50% of customers have a Credit Score between 597 and 710.

Age:

  • The Age of customers ranges from 18 to 92 years. The median age is 37 years, meaning half of the customers are younger than 37 and half are older.

  • The mean age is approximately 38.13 years, indicating that the average customer is in their late thirties.

  • The distribution of Age is slightly right-skewed, as the mean is slightly higher than the median.

Tenure:

  • Tenure, or the number of years customers have been with the bank, ranges from 0 to 10 years.

  • The median tenure is 5 years, indicating that half of the customers have been with the bank for less than 5 years and half for more.

  • The mean tenure is approximately 5.02 years, suggesting that the average customer has been with the bank for around 5 years.

Balance:

  • The account Balance ranges from a minimum of 0 to a maximum of 250,898.

  • The median balance is 0, indicating that at least half of the customers have no balance in their account.

  • The mean balance is approximately 55,478, suggesting that while many customers have low or zero balances, some have significant amounts in their accounts.

Number of Products:

  • The Number of Products customers have with the bank ranges from 1 to

  • The median number of products is 2, meaning that half of the customers have 2 or fewer products with the bank.

  • The mean number of products is approximately 1.554, indicating that on average, customers have between 1 and 2 products with the bank.

Estimated Salary:

  • The Estimated Salary ranges from a minimum of 11.58 to a maximum of 199,992.48.

  • The median estimated salary is 117,948, suggesting that half of the customers have an estimated salary below this amount and half above.

  • The mean estimated salary is approximately 112,574.82, indicating that the average estimated salary of customers is around 112k.

Count Of Categorical value types

Code
sapply(df[,c('Geography', 'Gender', 'HasCrCard', 'IsActiveMember', 'Exited')], function(x) length(unique(x)))
     Geography         Gender      HasCrCard IsActiveMember         Exited 
             3              2              2              2              2 

Checking null values

Code
colSums(is.na(df))
             id      CustomerId         Surname     CreditScore       Geography 
              0               0               0               0               0 
         Gender             Age          Tenure         Balance   NumOfProducts 
              0               0               0               0               0 
      HasCrCard  IsActiveMember EstimatedSalary          Exited 
              0               0               0               0 

There are no null values in the data.

Distribution of target variable

Code
table(df$Exited)

     0      1 
130113  34921 

We can see number of customers exited are more compared to number of customers not exited. So there is a quite imbalance in data which needs to be addressed while building the model.

Distribution of target variable across Geography.

Code
table(df$Geography, df$Exited)
         
              0     1
  France  78643 15572
  Germany 21492 13114
  Spain   29978  6235

France:

  • A total of 94,215 customers are from France.
  • Out of these, 78,643 customers have not exited the bank (retained),
  • while 15,572 customers have exited (churned).
  • The churn rate for France is approximately 16.53%.

Germany:

  • A total of 34,606 customers are from Germany.
  • Out of these, 21,492 customers have not exited the bank, while 13,114 customers have exited.
  • The churn rate for Germany is approximately 37.89%.

Spain:

  • A total of 36,213 customers are from Spain.
  • Out of these, 29,978 customers have not exited the bank, while 6,235 customers have exited.
  • The churn rate for Spain is approximately 17.21%.

Which Gender has highest Credit Score?

Code
aggregate(df$CreditScore, by = list(df$Gender), FUN = mean)
  Group.1        x
1  Female 656.2437
2    Male 656.6169

Observations:

  • The difference in average credit scores between male and female customers is minimal, indicating that gender does not significantly impact creditworthiness in this dataset.

  • Both genders have an average credit score in the mid-650s, which is considered a fair credit score range.

Distribution of Age.

Code
ggplot(df, aes(x = Age)) + geom_histogram(binwidth = 5, fill = "blue", color = "black")

Observations:

  • The largest concentration of customers falls within the 30 to 40-year-old range, indicating that the majority of customers are in their early to mid-career stages.

  • There is a significant drop in frequency as age increases, especially beyond 50 years. This suggests that the customer base skews younger.

  • The distribution is right-skewed, meaning there are fewer older customers (those over 60) compared to younger customers.

  • There is a small number of customers in the youngest age bracket (under 25 years) and the oldest (over 75 years).

Distribution of Estimated Salary:

Code
ggplot(df, aes(x = EstimatedSalary)) + geom_histogram(binwidth = 5, fill = "blue", color = "black")

Observations:

  • The distribution is quite uniform across different salary ranges, with no distinct peaks that would indicate a concentration of individuals around a specific salary bracket.

  • There are frequent spikes throughout the distribution, which may suggest that the data contains many unique values with small frequencies. This could be indicative of precise salary estimations rather than rounded figures.

  • The salaries range from very low values close to 0 up to 200,000, indicating a diverse group from potentially different economic backgrounds or job roles.

  • There is no obvious concentration of data points around the lower, middle, or upper salary range, which is unusual for income data where one typically expects to see more of a bell-shaped distribution centered around a median salary range.

Comparing the distribution of account balances between customers who have exited and customer who have not exited.

Code
ggplot(df, aes(x = as.factor(Exited), y = Balance)) + geom_boxplot()

Observations:

  • Balance Distribution:

    • The y-axis represents the balance on customer accounts, which seems to range from 0 to a bit over 250,000.

    • Both boxes have a similar interquartile range (IQR), which is the range between the first quartile (25th percentile) and the third quartile (75th percentile), represented by the height of the boxes. This suggests that the middle 50% of balances are similarly distributed between both groups.

    • The median, indicated by the line within each box, is roughly at the same level for both groups, suggesting that the central tendency of balance is similar regardless of whether the customer has exited or not.

  • Outliers:

    • There are visible outliers for both groups, indicated by the points beyond the whiskers of the box plot. These outliers represent customers with balances significantly higher than the general population of the dataset.

How the distribution of the number of products varies across different geographical regions?

Code
ggplot(df, aes(x = Geography, fill = as.factor(NumOfProducts))) + geom_bar(position = "dodge")

Observations:

  1. France:

    • France has the highest count of customers using one product, followed closely by those using two products. The number of customers using three and four products is significantly lower.
  2. Germany:

    • Germany shows a similar pattern to France with one and two products being the most common among customers. However, the count for one product is notably lower than in France, whereas the count for two products is slightly higher.
  3. Spain:

    • Spain’s pattern mirrors that of France and Germany, with one product being the most common, followed by two products. Again, three and four products are used by a considerably smaller number of customers.

Pairplot of Age vs Estimated Salary and also checking which age group and salary range have exited the bank.

Code
ggplot(df, aes(x = Age, y = EstimatedSalary, color = as.factor(Exited))) + geom_point()

Observations:

  1. There doesn’t appear to be a clear pattern or correlation between Age and Estimated Salary with customer churn, as the exited and non-exited customers are interspersed throughout the plot without any distinct clustering.

  2. Customers who have exited are spread across all ages and salary levels, but there seems to be a slightly higher concentration of churned customers in the 40 to 50 age range.

Pairplot of Age vs Credit Score and also checking which age group and Credit Score range have exited the bank.

Code
ggplot(df, aes(x = Age, y = CreditScore , color = as.factor(Exited))) + geom_point()

Observations:

  1. There is a wide distribution of Credit Scores across different ages with no clear pattern indicating that Credit Score by itself may not be a strong predictor of customer exit.

  2. Both exited and non-exited customers are found across the entire range of Credit Scores and Age, but there is a noticeable density of exited customers (blue dots) in the middle age range, particularly between ages 40 and 50.

Pairplot of EstimatedSalary vs Credit Score and also checking what Estimated Salary range and Credit Score range have exited the bank.

Code
ggplot(df, aes(x = EstimatedSalary, y = CreditScore , color = as.factor(Exited))) + geom_point()

Observations:

The scatter plot shows no clear correlation between Credit Score and Estimated Salary in predicting customer churn, with both customers who exited and those who did not evenly dispersed across all ranges of Salary and Credit Scores.

Correlation Plot

Code
corr_matrix <- cor(select(df, CreditScore, Age, Tenure, Balance, NumOfProducts, EstimatedSalary))
corrplot(corr_matrix, method = "circle")

Observations:

There seems to be a noticeable positive correlation between Age and Balance, and a negative correlation between NumOfProducts and Balance.

Churn Rate by Geography

Code
churn_by_country <- df %>%
  group_by(Geography) %>%
  summarise(
    Total_Customers = n(),
    Churned_Customers = sum(Exited),
    Churn_Rate = (sum(Exited) / n()) * 100
  )
Code
ggplot(churn_by_country, aes(x = Geography, y = Churn_Rate, fill = Geography)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(Churn_Rate, 2)), vjust = -0.3) +
  labs(title = "Churn Rate by Country",
       x = "Geography",
       y = "Churn Rate (%)") +
  theme_minimal() +
theme(legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5)) 

Code
sample_size <- 20000
sample <- createDataPartition(df$Exited, p = sample_size / nrow(df), list = FALSE)
df <- df[sample, ]
Code
df <- df[, !(names(df) %in% c('id', 'CustomerId', 'Surname'))]
df$Gender <- ifelse(df$Gender == 'Male', 1, 0)
Code
df <- cbind(df, model.matrix(~ Geography - 1, data = df))
df <- df[, !names(df) %in% "Geography"]

Split data into train and test

Code
set.seed(123)  # For reproducibility
splitIndex <- createDataPartition(df$Exited, p = 0.8, list = FALSE)

train <- df[splitIndex, ]
test <- df[-splitIndex, ]
test_data <- test[, -which(names(test) == "Exited")]
Code
names(train)
 [1] "CreditScore"      "Gender"           "Age"              "Tenure"          
 [5] "Balance"          "NumOfProducts"    "HasCrCard"        "IsActiveMember"  
 [9] "EstimatedSalary"  "Exited"           "GeographyFrance"  "GeographyGermany"
[13] "GeographySpain"  
Code
dim(train)
[1] 16000    13
Code
table(train$Exited)

    0     1 
12580  3420 

Handle Class Imbalance

Code
train_balanced <- ovun.sample(Exited ~ ., data = train, method = "both", N = 16000, p = 0.5)$data
Code
table(train_balanced$Exited)

   0    1 
8149 7851 

Scaling Data

Code
# Select numerical columns for normalization
numerical_cols <- c("CreditScore", "Age", "Tenure", "Balance", "NumOfProducts", "EstimatedSalary")

# Compute the mean and standard deviation from the training set
means <- apply(train_balanced[numerical_cols], 2, mean, na.rm = TRUE)
sds <- apply(train_balanced[numerical_cols], 2, sd, na.rm = TRUE)

# Normalize the training set
train_balanced[numerical_cols] <- sweep(train_balanced[numerical_cols], 2, means, "-")
train_balanced[numerical_cols] <- sweep(train_balanced[numerical_cols], 2, sds, "/")

# Normalize the test set using the same parameters
test[numerical_cols] <- sweep(test[numerical_cols], 2, means, "-")
test[numerical_cols] <- sweep(test[numerical_cols], 2, sds, "/")

Statistical Modelling

In this modeling phase, three different machine learning models are trained for a classification task using the caret package in R. The models include a Support Vector Machine (SVM) with a radial basis function kernel, a Random Forest model.

The target variable Exited is converted to a factor to ensure that it is treated as a categorical variable for classification. A consistent seed (set.seed(123)) is set before training each model to ensure reproducibility of the results.

The trainControl function is used to set up the training control parameters, specifying 5-fold cross-validation (method = "cv") to assess the performance of the models.

For the SVM model, data preprocessing steps (preProcess = c("center", "scale")) are included to center and scale the features, which is often necessary for SVM models to perform well.

Each model is trained on the train_balanced dataset using the train function from the caret package, with the model type specified by the method parameter ("svmRadial" for SVM, "rf" for Random Forest, and "xgbTree" for XGBoost).

The trained models are stored in a list named models for easy access and further evaluation. This modular approach allows for straightforward comparison of the performance of different models on the same dataset.

Code
train_control <- trainControl(method = "cv", number = 5)

# Set up a list to store models
models <- list()

# Train an SVM model
train_balanced$Exited <- as.factor(train_balanced$Exited)
set.seed(123)
models$svm <- train(Exited ~ ., 
                    data = train_balanced, 
                    method = "svmRadial",  # Radial basis function kernel
                    trControl = train_control,
                    preProcess = c("center", "scale"))

# Train a Random Forest model
set.seed(123)
models$random_forest <- train(Exited ~ ., 
                               data = train_balanced, 
                               method = "rf",  # Random Forest
                               trControl = train_control)

# Train an XGBoost model
#set.seed(123)
#models$xgboost <- suppressWarnings(train(Exited ~ ., 
                        #data = train_balanced, 
                        #method = "xgbTree",  # XGBoost
                        #trControl = train_control))
Code
for (model_name in names(models)) {
    set.seed(123)
  
    cat("****** Model:", model_name, " ******\n")
    # Make predictions
    test_data <- test[, -which(names(test) == "Exited")]
    predictions <- predict(models[[model_name]], test_data)
    predictions_factor <- factor(predictions, levels = c("0", "1"))
    exited_factor <- factor(test$Exited, levels = c("0", "1"))
    probabilities <- predict(models[[model_name]], test_data, type = "prob")[,2]
    confusion_matrix <- confusionMatrix(predictions_factor, exited_factor)
    accuracy <- confusion_matrix$overall['Accuracy']
    f1_score <- confusion_matrix$byClass['F1']
    
    # Calculate AUC-ROC
    roc_curve <- roc(response = test$Exited, predictor = as.numeric(predictions))
    auc_roc <- auc(roc_curve)
    
    # Print metrics
    print(confusion_matrix)
    cat("AUC-ROC:", auc_roc, "\n")
    
    # Plot AUC-ROC curve
    #plot(roc_curve, main = paste("AUC-ROC Curve for", model_name))
    
    cat("\n")
}
****** Model: svm  ******
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2597  160
         1  606  637
                                         
               Accuracy : 0.8085         
                 95% CI : (0.796, 0.8206)
    No Information Rate : 0.8008         
    P-Value [Acc > NIR] : 0.1133         
                                         
                  Kappa : 0.5041         
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.8108         
            Specificity : 0.7992         
         Pos Pred Value : 0.9420         
         Neg Pred Value : 0.5125         
             Prevalence : 0.8007         
         Detection Rate : 0.6492         
   Detection Prevalence : 0.6893         
      Balanced Accuracy : 0.8050         
                                         
       'Positive' Class : 0              
                                         
AUC-ROC: 0.8050248 

****** Model: random_forest  ******
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2752  255
         1  451  542
                                          
               Accuracy : 0.8235          
                 95% CI : (0.8113, 0.8352)
    No Information Rate : 0.8008          
    P-Value [Acc > NIR] : 0.0001405       
                                          
                  Kappa : 0.4936          
                                          
 Mcnemar's Test P-Value : 2.153e-13       
                                          
            Sensitivity : 0.8592          
            Specificity : 0.6801          
         Pos Pred Value : 0.9152          
         Neg Pred Value : 0.5458          
             Prevalence : 0.8007          
         Detection Rate : 0.6880          
   Detection Prevalence : 0.7518          
      Balanced Accuracy : 0.7696          
                                          
       'Positive' Class : 0               
                                          
AUC-ROC: 0.7696223 

Conclusion

Metric SVM Random Forest
Accuracy 79.52% 83.38%
Kappa 0.494 0.5375
Sensitivity 79.11% 86.48%
Specificity 81.09% 71.70%
Positive Predictive Value 94.02% 91.99%
Negative Predictive Value 50.82% 58.54%
Balanced Accuracy 80.10% 79.09%
AUC-ROC 0.801 0.791

In this analysis of bank customer churn prediction, the Support Vector Machine (SVM) model has shown promising results, particularly in terms of specificity (81.09%) and positive predictive value (94.02%). These metrics are crucial in the banking context, as they indicate the model’s accuracy in correctly identifying loyal customers (specificity) and its reliability in flagging potential churners (positive predictive value). Additionally, the SVM model’s highest AUC-ROC score of 0.801 underscores its effectiveness in distinguishing between customers who are likely to churn and those who are not across various decision thresholds.

Although the Random Forest model exhibited the highest overall accuracy (83.38%) and kappa score (0.5375), its lower specificity and negative predictive value compared to the SVM model suggest it may produce more false positives, leading to misallocated retention efforts.

References

  1. Jakkula, V. (2006). Tutorial on support vector machine (svm). School of EECS, Washington State University37(2.5), 3.

  2. Kecman, V. (2005). Support vector machines-an introduction. In Support vector machines theory and applications (pp. 1-47). Berlin, Heidelberg: Springer Berlin Heidelberg

  3. Yue, S., Li, P., & Hao, P. (2003). SVM classification: Its contents and challenges. Applied Mathematics-A Journal of Chinese Universities, 18, 332-342.

  4. Jun, Z. (2021). The development and application of support vector machine. In Journal of Physics: Conference Series (Vol. 1748, No. 5, p. 052006). IOP Publishing.

  5. Bhavsar, H., & Panchal, M. H. (2012). A review on support vector machine for data classification. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 1(10), 185-189.

  6. Deris, A. M., Zain, A. M., & Sallehuddin, R. (2011). Overview of support vector machine in modeling machining performances. Procedia Engineering24, 308-312.

  7. Han, Shuo. “Using SVM with Financial Statement Analysis for Prediction of Stocks.” Communications of the IIMA Communications of the IIMA, vol. 7, 2007, scholarworks.lib.csusb.edu/cgi/viewcontent.cgi?article=1059&context=ciima.

  8. Ahmadi, Muhammad Iqbal, et al. “SENTIMENT ANALYSIS ONLINE SHOP on the PLAY STORE USING METHOD SUPPORT VECTOR MACHINE (SVM).” Seminar Nasional Informatika (SEMNASIF), vol. 1, no. 1, 15 Dec. 2020, pp. 196–203, jurnal.upnyk.ac.id/index.php/semnasif/article/view/4101. Accessed 13 Feb. 2024.

  9. Razzaghi, Talayeh, et al. “Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values.” PLOS ONE, vol. 11, no. 5, 19 May 2016, p. e0155119, https://doi.org/10.1371/journal.pone.0155119.

  10. Öz, Ersoy, and Hüseyin Kaya. “Support Vector Machines for Quality Control of DNA Sequencing.” Journal of Inequalities and Applications, vol. 2013, no. 1, 4 Mar. 2013, https://doi.org/10.1186/1029-242x-2013-85. Accessed 15 June 2021.

  11. “Support Vector Machine for Network Intrusion and Cyber-Attack Detection | IEEE Conference Publication | IEEE Xplore.” Ieeexplore.ieee.org, ieeexplore.ieee.org/abstract/document/8233268. Accessed 13 Feb 2024.

  12. Kumar, Sachin, et al. “Precision Sugarcane Monitoring Using SVM Classifier.” Procedia Computer Science, vol. 122, 2017, pp. 881–887, https://doi.org/10.1016/j.procs.2017.11.450. Accessed 25 July 2019.

  13. Javeed, A. et al. (2023) Early prediction of dementia using feature Extraction Battery (FEB) and optimized support vector machine (SVM) for Classification, MDPI. Available at: https://www.mdpi.com/2227-9059/11/2/439 (Accessed: 22 January 2024).

  14. Nawal, Y., Oussalah, M., Fergani, B., & Fleury, A. (2022). New incremental SVM algorithms for human activity recognition in smart homes. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-022-03798-w

  15. Zhang, L., Hu, H., & Zhang, D. (2015). A credit risk assessment model based on SVM for small and medium enterprises in supply chain finance. Financial Innovation1(1). https://doi.org/10.1186/s40854-015-0014-5

  16. Harimoorthy, K., Thangavelu, M. RETRACTED ARTICLE: Multi-disease prediction model using improved SVM-radial bias technique in healthcare monitoring system. J Ambient Intell Human Comput 12, 3715–3723 (2021). https://doi.org/10.1007/s12652-019-01652-0

  17. J. Liang, Z. Qin, L. Xue, X. Lin and X. Shen, “Verifiable and Secure SVM Classification for Cloud-Based Health Monitoring Services,” in IEEE Internet of Things Journal, vol. 8, no. 23, pp. 17029-17042, 1 Dec.1, 2021, doi: 10.1109/JIOT.2021.3075540.

  18. G. N. Ahmad, H. Fatima, S. Ullah, A. Salah Saidi and Imdadullah, “Efficient Medical Diagnosis of Human Heart Diseases Using Machine Learning Techniques With and Without GridSearchCV,” in IEEE Access, vol. 10, pp. 80151-80173, 2022, doi: 10.1109/ACCESS.2022.3165792.

  19. Sahara, S., Annida Purnamawati, Sulaeman Hadi Sukmana, Mely Mailasari, Erma Delima Sikumbang, & Puji, E. (2023). PSO optimization for analysis of online marketplace products on the SVM method. AIP Conference Proceedings. https://doi.org/10.1063/5.0129404

  20. “Prediction of Consumer Purchasing in a Grocery Store Using Machine Learning Techniques.” Ieeexplore.ieee.org, ieeexplore.ieee.org/document/7941935.

  21. Barakat, Nahla, et al. “Intelligible Support Vector Machines for Diagnosis of Diabetes Mellitus.” IEEE Transactions on Information Technology in Biomedicine, vol. 14, no. 4, July 2010, pp. 1114–1120, https://doi.org/10.1109/titb.2009.2039485.

  22. “Applying Support Vector Machine to Electronic Health Records for Cancer Classification | IEEE Conference Publication | IEEE Xplore.” Ieeexplore.ieee.org, ieeexplore.ieee.org/abstract/document/8732906.

  23. “An Effective Intrusion Detection Approach Using SVM with Naïve Bayes Feature Embedding.” Computers & Security, vol. 103, 1 Apr. 2021, p. 102158, www.sciencedirect.com/science/article/pii/S0167404820304314, https://doi.org/10.1016/j.cose.2020.102158.

  24. Hosseini, Soodeh, and Behnam Mohammad Hasani Zade. “New Hybrid Method for Attack Detection Using Combination of Evolutionary Algorithms, SVM, and ANN.” Computer Networks, vol. 173, May 2020, p. 107168, https://doi.org/10.1016/j.comnet.2020.107168.